Data preparation is the very first thing that you do and spend a lot of time on as a data analyst much before trying to build predictive models using that data.
In essence data preparation is all about processing data to get it ready for all kinds of analysis. In industry data collection is mostly driven by business process at front, not by the needs of predictive models. These various processes at some or the other point become reason for introduction of errors here and there in the data.
There can be many kind of reasons not necessarily errors for which we would need to pre-process our data and change it for better.
Lets begin with importing all the relevant libraries and packages.
library(dplyr)
library(psych)
library(vcd)
We are going to import data file bank-full.csv here. Lets begin, we’ll start with passing just the file name and let all other be option take their defaults. We’ll change some as we come across issue with the imported data.
bd = read.csv("https://raw.githubusercontent.com/insaid2018/R/master/Data/bank-full.csv")
head(bd,2)
You can see that we have been fooled by the file extension and assumed that the separator for the data is comma where as in reality it is “;”. Lets tell that to R by using option sep.
bd = read.csv("https://raw.githubusercontent.com/insaid2018/R/master/Data/bank-full.csv", sep=";")
head(bd,2)
Okay, this looks better. Now, lets look at our data.
glimpse(bd)
Observations: 45,211
Variables: 17
$ age [3m[90m<int>[39m[23m 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, 51, 45, 57, 60, ...
$ job [3m[90m<fct>[39m[23m management, technician, entrepreneur, blue-collar, unknown, management, mana...
$ marital [3m[90m<fct>[39m[23m married, single, married, married, single, married, single, divorced, marrie...
$ education [3m[90m<fct>[39m[23m tertiary, secondary, secondary, unknown, unknown, tertiary, tertiary, tertia...
$ default [3m[90m<fct>[39m[23m no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no, no, no, no, no,...
$ balance [3m[90m<int>[39m[23m 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71, 162, 229, 13, ...
$ housing [3m[90m<fct>[39m[23m yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye...
$ loan [3m[90m<fct>[39m[23m no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no, no, no, no, no...
$ contact [3m[90m<fct>[39m[23m unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown, unkn...
$ day [3m[90m<int>[39m[23m 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ month [3m[90m<fct>[39m[23m may, may, may, may, may, may, may, may, may, may, may, may, may, may, may, m...
$ duration [3m[90m<int>[39m[23m 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517, 71, 174, 353, 9...
$ campaign [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ pdays [3m[90m<int>[39m[23m -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...
$ previous [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ poutcome [3m[90m<fct>[39m[23m unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown, unkn...
$ y [3m[90m<fct>[39m[23m no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, ...
We can also compute the descriptive statistics using the describe function available in psych package.
describe(bd)
Lets look at the 6 point summary of our data using the summary function.
summary(bd)
age job marital education default
Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851 no :44396
1st Qu.:33.00 management :9458 married :27214 secondary:23202 yes: 815
Median :39.00 technician :7597 single :12790 tertiary :13301
Mean :40.94 admin. :5171 unknown : 1857
3rd Qu.:48.00 services :4154
Max. :95.00 retired :2264
(Other) :6835
balance housing loan contact day month
Min. : -8019 no :20081 no :37967 cellular :29285 Min. : 1.00 may :13766
1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906 1st Qu.: 8.00 jul : 6895
Median : 448 unknown :13020 Median :16.00 aug : 6247
Mean : 1362 Mean :15.81 jun : 5341
3rd Qu.: 1428 3rd Qu.:21.00 nov : 3970
Max. :102127 Max. :31.00 apr : 2932
(Other): 6060
duration campaign pdays previous poutcome y
Min. : 0.0 Min. : 1.000 Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
Median : 180.0 Median : 2.000 Median : -1.0 Median : 0.0000 success: 1511
Mean : 258.2 Mean : 2.764 Mean : 40.2 Mean : 0.5803 unknown:36959
3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :4918.0 Max. :63.000 Max. :871.0 Max. :275.0000
You can see that all of our character columns have been stored as factors. This needs to be avoided. And we can do so by using option stringsAsFactors.
bd=read.csv("https://raw.githubusercontent.com/insaid2018/R/master/Data/bank-full.csv", sep=";",stringsAsFactors = FALSE)
glimpse(bd)
Observations: 45,211
Variables: 17
$ age [3m[90m<int>[39m[23m 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, 51, 45, 57, 60, ...
$ job [3m[90m<chr>[39m[23m "management", "technician", "entrepreneur", "blue-collar", "unknown", "manag...
$ marital [3m[90m<chr>[39m[23m "married", "single", "married", "married", "single", "married", "single", "d...
$ education [3m[90m<chr>[39m[23m "tertiary", "secondary", "secondary", "unknown", "unknown", "tertiary", "ter...
$ default [3m[90m<chr>[39m[23m "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no", "no", "no", "no...
$ balance [3m[90m<int>[39m[23m 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71, 162, 229, 13, ...
$ housing [3m[90m<chr>[39m[23m "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", ...
$ loan [3m[90m<chr>[39m[23m "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no", "n...
$ contact [3m[90m<chr>[39m[23m "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unknown",...
$ day [3m[90m<int>[39m[23m 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ month [3m[90m<chr>[39m[23m "may", "may", "may", "may", "may", "may", "may", "may", "may", "may", "may",...
$ duration [3m[90m<int>[39m[23m 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517, 71, 174, 353, 9...
$ campaign [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ pdays [3m[90m<int>[39m[23m -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...
$ previous [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ poutcome [3m[90m<chr>[39m[23m "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unknown",...
$ y [3m[90m<chr>[39m[23m "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no"...
Next, lets look at what values our variable job takes.
table(bd$job)
admin. blue-collar entrepreneur housemaid management retired self-employed
5171 9732 1487 1240 9458 2264 1579
services student technician unemployed unknown
4154 938 7597 1303 288
There are 288 observations where the value is unknown, if you want, you can set it to missing by using option na.strings. But, remember this will set the value unknown as missing for all the columns. If you want to do it only for one of columns then do that after you have imported the data.
bd=read.csv("bank-full.csv",sep=";",stringsAsFactors = FALSE,na.strings = "unknown")
sum(is.na(bd$job))
You can see that, now column job has 288 missing values. This was to show you how to use option na.string. In general it is not a good practice to set any random value as missing. So, for practice its alright, but dont set unknown to missing in general unless you have good reason to do so. In fact in many of the cases of categorical variables, unknown itself can be taken as a valid category as you’ll realise later.
We dont need to change default values of other options for this importing. Same will be the case for you as well for most of the data. If it is not, feel free to use any of the option described above.
We’ll start with discussion on apply family of functions which are very handy way to summarize as well as do other operations collectively on your data. Lets say we want to get means of all columns present in the data mtcars. We could achieve that writing a for loop across columns with function mean.
rr for(i in 1:ncol(mtcars)){ print(mean(mtcars[,i])) }
[1] 20.09062
[1] 6.1875
[1] 230.7219
[1] 146.6875
[1] 3.596563
[1] 3.21725
[1] 17.84875
[1] 0.4375
[1] 0.40625
[1] 3.6875
[1] 2.8125
Fine you get the result, but its not in a very convenient format and code is not pretty. What if there exists a function which lets us do this without writing these loops and having to go through managing iterations for a function. This is such a common scenario in data processing, R has a family of dedicated funcitons for this.
rr x=round(rnorm(10),2) x
[1] 0.07 0.34 -1.99 1.03 -0.87 1.44 0.58 -0.23 1.25 -0.48
We’ll first talk about lapply. This lets you apply a function repeatedly on a list/vector, the outcome is a list of results. Lets see an example.
rr lapply(x,log10)
NaNs producedNaNs producedNaNs producedNaNs produced
[[1]]
[1] -1.154902
[[2]]
[1] -0.4685211
[[3]]
[1] NaN
[[4]]
[1] 0.01283722
[[5]]
[1] NaN
[[6]]
[1] 0.1583625
[[7]]
[1] -0.236572
[[8]]
[1] NaN
[[9]]
[1] 0.09691001
[[10]]
[1] NaN
But above operation can be easily achieved using vector operations. Infact a simple log(x) will give you the same result and in a much usable format as well. But then what good is this lapply thing?
Lets put it to a better use and with more options. How about, if you had a lot of text files in your folder and you wanted to import them all. One solution will be to write one line of read.csv for all these files or may be you can run a loop. Better you could pass all those names to function read.csv using lapply.
rr # Before running these codes , you’ll have to set your working directory to the folder . file_names = list.files(getwd(), pattern = *.TXT) files = lapply(file_names, read.csv, header = F, stringsAsFactors = F)
See, we can pass other common options to function read.csv in lapply. File names in the object file_names are passed one by one to function read.csv. The output file is simply a list of data frames. If you want to combine them, you can do so by using function do.call.
rr file = do.call(rbind,files)
do.call here passes all the elements in the second argument to function mentioned in first argument.
In just three simple lines of code, we have read data from 50+ files and combined it into one. In just three lines! All thanks to lapply. At times the output in form of a list becomes difficult to handle.
You can use sapply in these cases. sapply works exactly like lapply, only difference being, it tries to vectorise output of lapply if it is possible.
rr sapply(x, log)
NaNs producedNaNs producedNaNs producedNaNs produced
[1] -2.6592600 -1.0788097 NaN 0.0295588 NaN 0.3646431 -0.5447272 NaN
[9] 0.2231436 NaN
Coming back to our first problem of getting mean for all columns. Yes, you can use lapply, because data frames are nothing but list of vectors. There is another function in apply family named apply which provides better output and a little more funcitonality when you want to apply function iteratively on data frame elements.
rr apply(mtcars,2,mean)
mpg cyl disp hp drat wt qsec vs am
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250
gear carb
3.687500 2.812500
The first argument to apply is the name of the data frame. Third argument is the function which is going to get applied. Second argument takes two values: [1 or 2]. 1 stands for rows and 2 for columns. Which means, if you put 2 in the second argument, function in the third argument will be applied on the columns of the data frames. If the value is given as 1, function gets applied on rows.
Functions which you pass to apply are not limited to pre existing function in R. You can write your own and pass it to apply family functions. Lets write a function which returns upper limit of outliers given a variable column.
rr outlier_upper=function(x){ m=mean(x); s=sd(x); return(m+3*s); } apply(mtcars,2,outlier_upper)
mpg cyl disp hp drat wt qsec vs am
38.171469 11.545265 602.537956 352.376105 5.200599 6.152622 23.209580 1.949548 1.903223
gear carb
5.900912 7.658100
What you need to remember here is what goes as input to the function. In case of apply, input is the entire row or column. Lets use apply to find out how many outliers each column has according to function oulier_upper.
rr apply(mtcars,2,function(x) sum(x>outlier_upper(x)))
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 0 0 0 0 0 0 1
What if you want to get a group wise summary of any variable. tapply comes to your rescue. First arguement to tapply is the column for which we are looking for summary, second argument is the grouping variable, third argument is the function which will be applied on the groups.
rr tapply(mtcars\(mpg,mtcars\)am,mean)
0 1
17.14737 24.39231
For getting group wise summary of all the variable in the dataset mtcars you can use a combination of apply and tapply.
rr apply(mtcars,2,function(x) tapply(x,mtcars$am,mean))
mpg cyl disp hp drat wt qsec vs am gear carb
0 17.14737 6.947368 290.3789 160.2632 3.286316 3.768895 18.18316 0.3684211 0 3.210526 2.736842
1 24.39231 5.076923 143.5308 126.8462 4.050000 2.411000 17.36000 0.5384615 1 4.384615 2.923077
Creating variables/vectors with simple algebraic operations is straight forward. Lets add one variable to data frame Arthritis.
rr library(vcd) Arthritis\(new=log(Arthritis\)Age) head(Arthritis)
What if we wanted to create an indicator variable which takes value 0 or 1 according to Age being less than or greater than 40? These simple algebraic operations will not work. We’ll have to use conditional operators.
rr Arthritis\(new=as.numeric(Arthritis\)Age<40) head(Arthritis)
This seems trivial too, now what if we want Age to be floored to 40 whenever it is less than 40 and other wise kept as it is. We wont be able to achieve this with simple conditional statement either.
rr x = sample(40,10) x
[1] 26 18 9 11 12 40 31 2 34 14
We will be using function ifelse to achieve the same.
rr y = ifelse(x>20,20,x) y
[1] 20 18 9 11 12 20 20 2 20 14
Now, lets use this to add a variable in the data frame.
rr Arthritis\(new = ifelse(Arthritis\)Age<40, 40, Arthritis$Age) head(Arthritis)
We have seen ways to modify and summarise data in base R. Again those functionalities are kind of scattered and not streamlined. If you think about it you can achieve almost all kind of modifications to data using these verbs:
Package dplyr comes with these verbs for data wrangling. Next, we’ll see how to achieve different data wrangling task in base R and the same in dplyr. Of course dplyr comes with some addtional fuctionlaities too and we’ll be looking at those as well.
Before we start, install packages dplyr and hflights.
rr library(dplyr) library(hflights)
We’ll be using data set hflights. You can get details of the data hflights after you have loaded library hflights by typing ?hflights.
rr ?hflights
Lets have a look at our data.
rr data(hflights) head(hflights)
We’ll start with our first function tbl_df which converts a data.frame to a tabular format for which display on console is better. It changes nothing else about the data frame.
rr flights = tbl_df(hflights) flights
Lets look at condition filtering of the data. We’ll start with base R approach to view all flights on January 1
rr flights[flights\(Month==1 & flights\)DayofMonth==1,]
You can use comma or ampersand to represent AND condition
rr #note: you can use comma or ampersand to represent AND condition filter(flights, Month==1, DayofMonth==1)
You can use pipe for OR condition
rr # You can use pipe for OR condition filter(flights, UniqueCarrier== | UniqueCarrier==)
You can also use %in% operator.
rr # You can also use %in% operator filter(flights, UniqueCarrier %in% c(, ))
See, you don’t need to bother with that $ reference to data frame all the time. Code is much neater and readable.
Lets look at column selection dropping by name. You’ll be definietly surprised by the additional functionalities of dplyr.
rr # base R approach to select DepTime, ArrTime, and FlightNum columns flights[, c(, , )]
rr # dplyr approach select(flights, DepTime, ArrTime, FlightNum)
Use colon to select multiple continuous columns, and use contains to match columns by name note: “starts_with”, “ends_with” can also be used to match columns by name
rr select(flights, Year:DayofMonth, contains(), contains())
Now, what if we wanted to do many operations at once; for example, selction and conditional filtering. We can do so by nesting our functions.
rr # nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
This nesting methodology becomes very cumbersome. This defies the purpose with which we started, making our code more readable.
Here comes to your rescue %>% operator, also called chaining operator. Basically, when you use this operator, every subsequent line of code inherits inputs from the previous line. You’ll be able to better understand this with the following example. Later on we’ll rewrite the above nested code with the chaining operator.
rr x=sample(10,6) x %>% log() %>% sum()
[1] 8.014336
See, you don’t have to pass any input to those functions, x goes as input to log and then modified x as log(x) goes as input to sum.
Lets see how we can use this to rewrite the nested function that we saw above.
rr # chaining method flights %>% select(UniqueCarrier, DepDelay) %>% filter(DepDelay > 60)
See, no need to nest or keep on giving data reference for every operation. Isn’t that neat!!
Next we move to ordering/sorting our data by using verb arrange.
rr # base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay flights[order(flights$DepDelay), c(, )]
rr # dplyr approach flights %>% select(UniqueCarrier, DepDelay) %>% arrange(DepDelay)
Next step is to mutate or modifying/adding data to existing data.
rr # base R approach to create a new variable Speed (in mph) flights\(Speed <- flights\)Distance / flights$AirTime*60 flights[, c(, , )]
rr # dplyr approach flights %>% select(Distance, AirTime) %>% mutate(Speed = Distance/AirTime*60)
What we have been doing is getting the output to display, if you wanted to save it could do as we usually do in R. Say, we wanted to save above output to some data frame.
rr flight_sub=flights %>% select(Distance, AirTime) %>% mutate(Speed = Distance/AirTime*60)
We are done with data wrangling without collapsing it. Next we look at exactly that, summarising data by groups or collapsing data to its group wise summaries using dplyr.
rr # dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay flights %>% group_by(Dest) %>% summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
This pretty much finishes our discussion on dplyr verbs and adverbs. We have given few more examples to learn new useful functionalities which we havent been introduced yet.
For each day of the year, count the total number of flights and sort in descending order
rr # for each day of the year, count the total number of flights and sort in descending order z=flights %>% group_by(Month, DayofMonth) %>% summarise(flight_count = n()) %>% arrange(desc(flight_count))
Lets rewrite more simply with the tally function
rr # rewrite more simply with the ‘tally’ function flights %>% group_by(Month, DayofMonth) %>% tally(sort = TRUE)
For each destination, count the total number of flights and the number of distinct planes that flew there.
rr # for each destination, count the total number of flights and the number of distinct planes that flew there flights %>% group_by(Dest) %>% summarise(flight_count = n(), plane_count = n_distinct(TailNum))
For each destination, show the number of cancelled and not cancelled flights.
rr # For each destination, show the number of cancelled and not cancelled flights flights %>% group_by(Dest) %>% select(Cancelled) %>% table() %>% head()
Adding missing grouping variables: `Dest`
Cancelled
Dest 0 1
ABQ 2787 25
AEX 712 12
AGS 1 0
AMA 1265 32
ANC 125 0
ASE 120 5
For each month, calculate the number of flights and the change from the previous month.
rr # for each month, calculate the number of flights and the change from the previous month flights %>% group_by(Month) %>% summarise(flight_count = n()) %>% mutate(change = flight_count - lag(flight_count))
Again, lets rewrite more simply with the tally function.
rr # rewrite more simply with the tally function flights %>% group_by(Month) %>% tally() %>% mutate(change = n - lag(n))
Base R approach to view the structure of an object
rr # base R approach to view the structure of an object str(flights)
Classes âtbl_dfâ, âtblâ and 'data.frame': 227496 obs. of 22 variables:
$ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 1 1 1 1 1 1 1 1 1 1 ...
$ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
$ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
$ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
$ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
$ UniqueCarrier : chr \AA\ \AA\ \AA\ \AA\ ...
$ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
$ TailNum : chr \N576AA\ \N557AA\ \N541AA\ \N403AA\ ...
$ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
$ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
$ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
$ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
$ Origin : chr \IAH\ \IAH\ \IAH\ \IAH\ ...
$ Dest : chr \DFW\ \DFW\ \DFW\ \DFW\ ...
$ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
$ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
$ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
$ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
$ CancellationCode : chr \\ \\ \\ \\ ...
$ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
$ Speed : num 336 299 280 345 305 ...
dplyr approach: better formatting, and adapts to your screen width.
glimpse(flights)
Error in glimpse(flights) : could not find function "glimpse"
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
As moving slowly towards predictive modelling, you’d need to take randome sample from your data for different purposes. You can achieve that in a rather simple manner by using the function sample which we have used a lot so far. Here goes an example for taking 70% random data from data frame mtcars.
rr # we are using set.seed for our random sample to be reproducible set.seed(1) s=sample(1:nrow(mtcars),0.7*nrow(mtcars))
You can now use this vector s as row index vector to take sample data from the data frame.
rr mtcars_sample=mtcars[s,]
How do I get the rest of the observations which are not in sample taken above?
rr mtcars_remaining=mtcars[-s,]
How to randomly bootstrap your data? Again, you can achieve that by using sampling with replacement with function sample
rr set.seed(1) s=sample(1:nrow(mtcars),100,replace = TRUE) mtcars_bootstrapped=mtcars[s,]